Media
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Asia > Middle East > Israel (0.04)
- Research Report > Promising Solution (0.68)
- Instructional Material > Course Syllabus & Notes (0.43)
- Education > Educational Technology > Audio & Video (0.53)
- Education > Educational Technology > Media (0.43)
- Research Report (1.00)
- Instructional Material > Course Syllabus & Notes (0.61)
- Education > Educational Technology > Audio & Video (0.71)
- Education > Educational Technology > Media (0.61)
COBE: Contextualized Object Embeddings from Narrated Instructional Video Supplementary Materials
Dartmouth College Our supplementary materials consist of: 1. Implementation Details. As before the performance of each model variant is evaluated according to the standard mAP detection metric. The ablation studies are conducted on the test set of HowTo100M_BB dataset. As expected, a larger number of negative per single positive sample leads to better results.
- North America > Canada (0.05)
- Africa > Ethiopia > Addis Ababa > Addis Ababa (0.05)
- Education > Educational Technology > Media (0.41)
- Education > Educational Technology > Audio & Video (0.41)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Asia > Middle East > Israel (0.04)
- Research Report > Promising Solution (0.68)
- Instructional Material > Course Syllabus & Notes (0.43)
- Education > Educational Technology > Audio & Video (0.53)
- Education > Educational Technology > Media (0.43)
NoteIt: A System Converting Instructional Videos to Interactable Notes Through Multimodal Video Understanding
Zhao, Running, Jiang, Zhihan, Zhang, Xinchen, Chang, Chirui, Chen, Handi, Deng, Weipeng, Jin, Luyao, Qi, Xiaojuan, Qian, Xun, Ngai, Edith C. H.
Users often take notes for instructional videos to access key knowledge later without revisiting long videos. Automated note generation tools enable users to obtain informative notes efficiently. However, notes generated by existing research or off-the-shelf tools fail to preserve the information conveyed in the original videos comprehensively, nor can they satisfy users' expectations for diverse presentation formats and interactive features when using notes digitally. In this work, we present NoteIt, a system, which automatically converts instructional videos to interactable notes using a novel pipeline that faithfully extracts hierarchical structure and multimodal key information from videos. With NoteIt's interface, users can interact with the system to further customize the content and presentation formats of the notes according to their preferences. We conducted both a technical evaluation and a comparison user study (N=36). The solid performance in objective metrics and the positive user feedback demonstrated the effectiveness of the pipeline and the overall usability of NoteIt. Project website: https://zhaorunning.github.io/NoteIt/
- North America > United States > New York > New York County > New York City (0.05)
- Asia > South Korea > Busan > Busan (0.05)
- Asia > China > Hong Kong (0.05)
- (17 more...)
- Research Report (1.00)
- Instructional Material > Course Syllabus & Notes (1.00)
- Education > Educational Technology > Audio & Video (0.96)
- Education > Educational Technology > Media (0.86)
MS4UI: A Dataset for Multi-modal Summarization of User Interface Instructional Videos
Zang, Yuan, Tan, Hao, Yoon, Seunghyun, Dernoncourt, Franck, Gu, Jiuxiang, Kafle, Kushal, Sun, Chen, Bui, Trung
We study multi-modal summarization for instructional videos, whose goal is to provide users an efficient way to learn skills in the form of text instructions and key video frames. We observe that existing benchmarks focus on generic semantic-level video summarization, and are not suitable for providing step-by-step executable instructions and illustrations, both of which are crucial for instructional videos. We propose a novel benchmark for user interface (UI) instructional video summarization to fill the gap. We collect a dataset of 2,413 UI instructional videos, which spans over 167 hours. These videos are manually annotated for video segmentation, text summarization, and video summarization, which enable the comprehensive evaluations for concise and executable video summarization. We conduct extensive experiments on our collected MS4UI dataset, which suggest that state-of-the-art multi-modal summarization methods struggle on UI video summarization, and highlight the importance of new methods for UI instructional video summarization.
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- (2 more...)
- Education > Educational Technology > Media (1.00)
- Education > Educational Technology > Audio & Video (1.00)
VideoGUI: A Benchmark for GUI Automation from Instructional Videos
Graphical User Interface (GUI) automation holds significant promise for enhancing human productivity by assisting with computer tasks. Existing task formulations primarily focus on simple tasks that can be specified by a single, language-only instruction, such as "Insert a new slide." In this work, we introduce VideoGUI, a novel multi-modal benchmark designed to evaluate GUI assistants on visual-centric GUI tasks. Sourced from high-quality web instructional videos, our benchmark focuses on tasks involving professional and novel software (e.g., Adobe Pho- toshop or Stable Diffusion WebUI) and complex activities (e.g., video editing). VideoGUI evaluates GUI assistants through a hierarchical process, allowing for identification of the specific levels at which they may fail: (i) high-level planning: reconstruct procedural subtasks from visual conditions without language descrip- tions; (ii) middle-level planning: generate sequences of precise action narrations based on visual state (i.e., screenshot) and goals; (iii) atomic action execution: perform specific actions such as accurately clicking designated elements.
- Education > Educational Technology > Media (0.64)
- Education > Educational Technology > Audio & Video (0.64)
Overview of the NLPCC 2025 Shared Task 4: Multi-modal, Multilingual, and Multi-hop Medical Instructional Video Question Answering Challenge
Li, Bin, Liu, Shenxi, Weng, Yixuan, Du, Yue, Tian, Yuhang, Zhou, Shoujun
Following the successful hosts of the 1-st (NLPCC 2023 Foshan) CMIVQA and the 2-rd (NLPCC 2024 Hangzhou) MMIVQA challenges, this year, a new task has been introduced to further advance research in multi-modal, multilingual, and multi-hop medical instructional question answering (M4IVQA) systems, with a specific focus on medical instructional videos. The M4IVQA challenge focuses on evaluating models that integrate information from medical instructional videos, understand multiple languages, and answer multi-hop questions requiring reasoning over various modalities. This task consists of three tracks: multi-modal, multilingual, and multi-hop Temporal Answer Grounding in Single Video (M4TAGSV), multi-modal, multilingual, and multi-hop Video Corpus Retrieval (M4VCR) and multi-modal, multilingual, and multi-hop Temporal Answer Grounding in Video Corpus (M4TAGVC). Participants in M4IVQA are expected to develop algorithms capable of processing both video and text data, understanding multilingual queries, and providing relevant answers to multi-hop medical questions. We believe the newly introduced M4IVQA challenge will drive innovations in multimodal reasoning systems for healthcare scenarios, ultimately contributing to smarter emergency response systems and more effective medical education platforms in multilingual communities. Our official website is https://cmivqa.github.io/
- Health & Medicine (1.00)
- Education > Educational Technology > Media (0.83)
- Education > Educational Technology > Audio & Video (0.83)
Review for NeurIPS paper: COBE: Contextualized Object Embeddings from Narrated Instructional Video
While this algorithm is specifically designed for detectors, Miech et al 2019 used unsupervised NCE losses (much like the ones in this paper) in order to understand the natural language descriptions associated with videos; the algorithm presented here seems like the most straightforward extension of this idea to bounding boxes. Little attention is given to demonstrating that the use of bounding boxes fundamentally changes the problem. Update The rebuttal addresses the following point regarding the accuracy of the evaluation. I had misunderstood the annotations that are available with epic kitchens, and therefore I am changing my review. I would encourage the authors to clarify the writing regarding what's available with epic kitchens.
- Education > Educational Technology > Media (0.40)
- Education > Educational Technology > Audio & Video (0.40)
Video-Mined Task Graphs for Keystep Recognition in Instructional Videos
Procedural activity understanding requires perceiving human actions in terms of a broader task, where multiple keysteps are performed in sequence across a long video to reach a final goal state---such as the steps of a recipe or the steps of a DIY fix-it task. Prior work largely treats keystep recognition in isolation of this broader structure, or else rigidly confines keysteps to align with a particular sequential script. We propose discovering a task graph automatically from how-to videos to represent probabilistically how people tend to execute keysteps, then leverage this graph to regularize keystep recognition in novel videos. On multiple datasets of real-world instructional video, we show the impact: more reliable zero-shot keystep localization and improved video representation learning, exceeding the state of the art.
- Education > Educational Technology > Media (0.66)
- Education > Educational Technology > Audio & Video (0.66)